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Abstract — K-means is a partitional clustering technique that 
iswell-known and widely used for its low computational cost. 
However, the performance of k-means algorithm tends to 
beaffected by skewed data distributions, i.e., imbalanced data. 
Theyoften produce clusters of relatively uniform sizes, even if 
input datahave varied a cluster size, which is called the “uniform 
effect.” Inthis paper, we analyze the causes of this effect and 
illustrate thatit probably occurs more in the k-means clustering 
process. As the minority class decreases in size, the “uniform 
effect” becomes evident. To prevent theeffect of the “uniform 
effect”, we revisit the well-known K-means algorithmand 
provide a general method to properly cluster imbalance 
distributed data. We present Imbalanced K-Means (IKM), a 
multi-purpose partitional clustering procedure that minimizes 
the clustering sum of squared error criterion, while imposing a 
hard sequentiality constraint in theclustering step. 

The proposed algorithm consists of a novel oversampling 
technique implemented by removing noisy and weak instances 
from both majority and minority classes and then oversampling 
only novel minority instances. We conduct experiments using 
twelve UCI datasets from various application domains using 
fivealgorithms for comparison on eight evaluation metrics. 
Experimental results show the effectiveness of the proposed 
clustering algorithm in clustering balanced and imbalanced 
data. 

Index Terms — Imbalanced data, k-meansclustering 

algorithms, oversampling, Imbalanced K-Means. 


I. Introduction 

Cluster analysis is a well-studied domain in data mining. In 
cluster analysis data is analyzed to find hidden relationships 
between each other to group a set of objects into clusters. One 
of the most popular methods in cluster analysis is k-means 
algorithm. The popularity and applicability of k-means 
algorithm in real time applications is due to its simplicity and 
high computational capability. Researchers have identified 
several factors [1] that may strongly affect the k-means 
clustering analysis including high dimensionality [2]-[4], 
sparseness of the data [5], noise and outliers in the data 
[6]—[8], scales of the data [9]—[12], types of attributes [13], 
[14], the fuzzy index m [15]—[18], initial cluster centers 
[19]-[24], and the number of clusters [25]-[27]. However, 
further investigation is the need of the hour to better 
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understand the efficiency of k-means algorithm with respect 
to the data distribution used for analysis. 

A good amount of research had done on the class balance data 
distribution for the performance analysis of k-means 
algorithm. For skewed-distributed data, the k-means 
algorithm tend to generate poor results as some instances of 
majority class are portioned into minority class, which makes 
clusters to have relatively uniform size instead of input data 
have varied cluster of non-uniform size. In [28] authors have 
defined this abnormal behavior of k-means clustering as the 
“uniform effect”. It is noteworthy that class imbalance is 
emerging as an important issue in cluster analysis especially 
for k-means type algorithms because many real-world 
problems, such as remote-sensing [29], pollution detection 
[30], risk management [31], fraud detection [32], and 
especially medical diagnosis [33]—[36] are of class 
imbalance.Furthermore, the rare class with the lowest number 
of instances is usually the class of interest from the point of 
view of the cluster analysis. 

Guhaet al.[37] early proposed to make use of multiple 
representative points to get the shape information of the 
“natural” clusters with nonspherical shapes [1] and achieve an 
improvement on noise robustness over the single-link 
algorithm. Liu et al. [38],proposed a multiprototype 
clustering algorithm, which applies thek-means algorithm to 
discover clusters of arbitrary shapes and sizes. However, 
there are following problems in the real applications of these 
algorithms to cluster imbalanced data. 1) These algorithms 
depend on a set of parameters whose tuning is problematic in 
practical cases. 2) These algorithms make use of the randomly 
sampling technique to find cluster centers. However, when 
data are imbalanced, the selected samples more probably 
come from the majority classes than the minority classes. 3) 
The number of clusters k needs to be determined in advance as 
an input to these algorithms. In a real dataset, k is usually 
unknown. 4) The separation measures between subclusters 
that are defined by these algorithms cannot effectively 
identify the complex boundary between two subclusters. 5) 
The definition of clusters in these algorithms is different from 
that of k-means. Xiongs al. [33] provided a formal and 
organized study of the effect of skewed data distributions on 
the hard k-means clustering. However, the theoretic analysis 
is only based on the hard k-means algorithm.Their 
shortcomings are analyzed and a novel algorithm is proposed. 

This paper focuses on clustering of binary dataset problems. 
The rest of this paper is organized as follows: Section 2 
presents the concept of class imbalance learning and the 
uniform effect in k-means algorithm. Section 3 presents the 
main related work about k-means clustering algorithm. 
Section 4 provides a detailed explanation of the Imbalanced 
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K-Means algorithm. Section 5 presents the datasets used for 
experiments. Section 6 presents the algorithms used for 
comparison. Section 7 presents the experimental results. 
Section 8 draws the conclusions and points out future 
research. 


II. Class Imbalance Learning 

One of the most popular techniques for alleviating the 
problems associated with class imbalance is data sampling. 
Data sampling alters the distribution of the training data to 
achieve a more balanced training data set. This can be 
accomplished in one of two ways: under sampling or 
oversampling. Under sampling removes majority class 
examples from the training data, while oversampling adds 
examples to the minority class. Both techniques can be 
performed either randomly or intelligently. 

The random sampling techniques either duplicate 
(oversampling) or remove (under sampling) random 
examples from the training data. Synthetic minority 
oversampling technique (SMOTE) [2] is a more intelligent 
oversampling technique that creates new minority class 
examples, rather than duplicating existing ones. Wilson’s 
editing (WE) [3] intelligently undersamples data by only 
removing examples that are thought to be noisy. In this study, 
we investigate the impact of intelligent oversampling 
technique on the performance of the clustering 
algorithms.While the impacts of noise and imbalance have 
been frequently investigated in isolation, their combined 
impacts have not received enough attention in research, 
particularly with respect to clustering algorithms. To alleviate 
this deficiency, we present a comprehensive empirical 
investigation of learning from noisy and imbalanced data 
using k-means clustering algorithm. 

Finding minority class examples effectively and accurately 
without losing overall performance is the objective of class 
imbalance learning. The fundamental issue to be resolved is 
that the clustering ability of most standard learning algorithms 
is significantly compromised by imbalanced class 
distributions. They often give high overall accuracy, but form 
very specific rules and exhibit poor generalization for the 
small class. In other words, overfitting happens to the 
minority class [6], [36], [37], [38], [39]. Correspondingly, the 
majority class is often overgeneralized. Particular attention is 
necessary for each class. It is important to know if a 
performance improvement happens to both classes and just 
one class alone. 

Many algorithms and methods have been proposed to 
ameliorate the effect of class imbalance on the performance of 
learning algorithms. There are three main approaches to these 
methods. 

• Internal approaches acting on the algorithm. These 
approaches modify the learning algorithm to deal with the 
imbalance problem. They can adapt the decision threshold to 
create a bias toward the minority class or introduce costs in 
the learning process to compensate the minority class. 


• External approaches acting on the data. These algorithms 
act on the data instead of the learning method. They have the 
advantage of being independent from the classifier used. 
There are two basic approaches: oversampling the minority 
class and undersampling the majority class. 

• Combined approaches that are based on boosting 
accounting for the imbalance in the training set. These 
methods modify the basic boosting method to account for 
minority class underrepresentation in the data set. There are 
two principal advantages of choosing sampling over 
cost-sensitive methods. First, sampling is more general as it 
does not depend on the possibility of adapting a certain 
algorithm to work with classification costs. Second, the 
learning algorithm is not modified, which can cause 
difficulties and add additional parameters to be tuned. 

III. Related Work: 

In, this section, we first review the major research about 
clustering in class imbalance learning and explain why we 
choose oversampling as our technique in this paper. Then, we 
introduce frequently used ensemble methods and evaluation 
criteria in class imbalance learning 

In recent years, clustering techniques have received much 
attention in wide areas of applicability such as medicine, 
engineering, finance and biotechnology. The main intention 
of clustering is to group data together which are having 
similar characteristics. Kaufman and Rousseeuw (1990) 
referred to clustering as “the art of finding groups in data”. It’s 
not fair to declare one clustering method as the best clustering 
method since the success of clustering method will highly 
depend on the type of data and the way of investigation for a 
specific applicability. Although many researchers attempted 
to make clustering process as a pure statistical technique but 
still largely it is regarded as an exploration procedure for 
finding the similar group of data. 

Haitaoxiang et al., [39]have proposed a localclustering 
ensemble learning method based on improved AdaBoost 
(LCEM) for rare class analysis. LCEM usesan improved 
weight updating mechanism where the weights of samples 
which are invariably correctlyclassified will be reduced while 
that of samples which are partially correctly classified will be 
increased. The proposed algorithm also perform clustering on 
normal class and produce sub-classes with relatively balanced 
sizes.AmuthanPrabakar et al., [40] have proposed a 
supervised network anomaly detection algorithm by the 
combination of k-means and C4.5 decision tree exclusively 
used for portioning and model building of the intrusion data. 
The proposed method is used mitigating the Forced 
Assignment and Class Dominance problems of the k-Means 
method. 

Fi Xuan et al., [41]have proposed two methods, in first 
method they applied random sampling of majority subset to 
form multiple balanced datasets for clustering and in second 
methodthey observed the clustering partitions of all the 
objects in the dataset under the condition of balance and 
imbalance at a different angle. Christos Bouraset al., [42]have 
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proposed W-k meansclustering algorithm for applicability on 
a corpus of news articles derived from major news portals. 
The proposed algorithm is an enhancement of standard 
k-means algorithm using the external knowledge for enriching 
the “bag of words” used prior to the clustering process 
andassisting the label generation procedure following it. 

P.Y. Mok et al., [43]have proposed a new clustering analysis 
method that identifies the desired cluster number and 
produces, at the same time, reliable clustering solutions. It 
first obtains many clustering results from a specific algorithm, 
such as Fuzzy C-Means (FCM), and then integrates these 
different results as a judgment matrix. An iterative 
graph-partitioning process is implemented to identify the 
desired cluster number and the final result. 

Luis A. Leivaet al., [44]have proposed Warped K-Means, a 
multi-purpose partition clustering procedure that 
minimizesthe sum of squared error criterion, while imposing a 
hard sequentiality constraint in theclassification step on 
datasets embedded implicitly with sequential information.The 
proposed algorithm is also suitable for online learning data, 
since the change of number of centroids and easy updating of 
new instances for the final cluster is possible.M.F.Jianget al., 
[45] have proposed variations of k-means algorithm to 
identify outliersby clustering the data the initial phase then 
using minimum spanning tree to identify outliers for their 
removal. 

Jie Cao et al., [46] have proposed a Summation-bAsed 
Incremental Learning (SAIL) algorithm for 
Information-theoretic K-means (Info-Kmeans) aims to cluster 
high-dimensional data, such as images featured by the 
bag-of-features (BOF) model, using K-means algorithm with 
KL-divergence as the distance. Since SAIL is a greedy 
scheme it first selects an instance from data and assigns it to 
the most suitable cluster. Then the objective-function value 
and other related variables are updated immediately after the 
assignment. The process will be repeated until some stopping 
criterion is met. One of the shortcomings is to select the 
appropriate cluster for an instance.Max Mignotte[47] has 
proposed anew and simple segmentation method based on the 
K-means clustering procedure for applicability on image 
segmentation.The proposed approach overcome the problem 
of local minima, feature space without considering spatial 
constraints and uniform effect. 


IV . Framework of IKM Algorithm 

This section presents the proposed algorithm, whose main 
characteristics are depicted in the following sections. Initially, 
the main concepts and principles of k-means are presented. 
Then, the definition of our proposed IKM is introduced in 
detail. 

K-means is one of the simplest unsupervised learning 
algorithms, first proposed by Macqueen in 1967, which has 
been used by many researchers to solve some well-known 
clustering problems [48]. The technique classifies a given 
data set into a certain number of clusters (assume k clusters). 


The algorithm first randomly initializes the clusters center. 
The next step is to calculate the distance between an object 
and the centroid of each cluster. Next each point belonging to 
a given data set is associated with the nearest center. The 
cluster centers are then re-calculated. The process is repeated 
with the aim of minimizing an objective function knows as 
squared error function given by: 
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The different components of our new proposed framework are 
elaborated in the following subsection. 

In the initial stage of our frame work the dataset is applied to a 
base algorithm for identifying mostly misclassified instances 
in both majority and minority classes. The instances which are 
misclassified are mostly weak instances and removing those 
instances from the majority and minority classes will not harm 
the dataset. In fact it will be helpful for improving the quality 
of the dataset in two fold; one way by removing weak 
instances from majority class will help to reduce the problem 
of class imbalance to a minor extend. Another is the removal 
of weak instances from minority class for the purpose of 
finding good instances to recursively replicate and hybridized 
for oversampling is also the part of the goal of the framework. 
The mostly misclassified instances are identified by using a 
base algorithm in this case C4.5 [49] is used. C4.5 is one of 
the best performing algorithms in the area of supervised 
learning. Our approach is classifier independent;i.e there is no 
constraint that the same classifier (in this case C4.5) has to be 
implemented for identifying mostly misclassified instances. 
The framework is introduced, and more interested researchers 
are encouraged to vary the components of the framework for 
more exploration. 

In the next phase, the datasets is partitioned into majority and 
minority subsets. As we are concentrating on over sampling, 
we will take minority data subset for further analysis to 
generate synthetic instances. 

Minority subset can be further analyzed to find the missing or 
noisy instances so that we can eliminate those. For finding 
noisy, boarder line and missing value instances for generating 
pure minority set, one of the ways is to go through a 
preprocessing process. 

The good instances remained in the minority subset are to be 
resampled i;e both replicated and hybridized instances are 
generated. The percentage of synthetic instances generated 
will range from 0 - 100 % depending upon the percentage of 
difference of majority and minority classes in the original 
dataset. The synthetic minority instances generated can have a 
percentage of instances which can be a replica of the pure 
instances and reaming percentage of instances are of the 
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hybrid type of synthetic instances generated by combing two 
or more instances from the pure minority subset. 

The oversampled minority subset and the majority subset are 
combined to form an almost balanced dataset, which is 
applied to a clustering algorithm. In this case we have used 
k-means clustering algorithm. The improvements in the 
imbalance dataset can be made into balance or almost balance 
depending upon the pure majority subset generated. The 
maximum synthetic minority instances generated are limited 
to 100% of the pure minority set formed. Our method will be 
superior to other oversampling methods since our approach 
uses the only available pure instances in the existing minority 
set for generating synthetic instances. 

Suppose that the whole training set is T, the minority classis P 
and the majority class is N , and 

P = {pi, p2 ,..., ppnum}, N = {nl,n2 ,...,nnnum } 


where/??mraand nnumare the number of minority and majority 
examples. The detailed procedure of IKMis as follows. 
_ Mg 

orithm: I KM 


_Inp 

ut: A set of minor class examples P, a set of major class examples N, 
jPj<jNj, and Fj ,the feature set, j > 0. 

Output: Average Measure { AUC, Precision, F-Measure, TP Rate, 
TN Rate } 

External selection Phase 

Step 1: For every pi (i= 1,2,..., pnum) in the minority class P, we 
calculate its m nearest neighbors from the whole training set T. The 
number of majority examples among the m nearest neighbors is 
denoted by m' (0 < m'< m) . 


Step 2: If m/ 2 < m'<m , namely the number of pi’s majority nearest 
neighbors is larger than the number of its minority ones, pi is 
considered to be easily misclassified and put into a set MISCLASS. 


MISSCLASS = m' 


Remove the instances m' from the minority set. 

Step 3: For every ni(i= 1,2,..., nnum) in the majority class N, we 
calculate its m nearest neighbors from the whole training set T. The 
number of majority examples among the m nearest neighbors is 
denoted by m' (0 < m'< m). 

Step 4: If m/ 2 < m'<m, namely the number of ni’s minority nearest 
neighbors is larger than the number of its majority ones, ni is 
considered to be easily misclassified and put into a set MISCLASS. 

MISSCLASS = m' 


number of majority examples among the m nearest neighbors is 
denoted by m' (0 < m'< m). 

If 0 < m'<m/ 2 , pi is a prominent example and need to be kept in 
minority set for resampling. 

Step 7: The examples in minority set are the prominent examples of 
the minority class P, and we can see that PR^p . We set 

PR= {p'l ,p'2 ,..., p'dnum}, 0 < dnum< pnum 

Step 8: In this step, we generate s x dnum synthetic positive 
examples from the pr examples in minority set, where s is an integer 
between 1 and k. One percentage of synthetic examples generated 
are replica of pr examples and other are the hybrid of pr examples. 

Clustering Phase 

Step 1: Select k random instances from the training data subset as 
the centroids of the clusters Cl; C2; ...Ck. 

Step 2: For each training instance X: 

a. Compute the Euclidean distance 
D(Ci,X),i = l...k 

b. Find cluster Cq that is closest to X. 

c. Assign X to Cq. 

Update the centroid of Cq. 

(The centroid of a cluster is the arithmeticmean of the instances in 
the cluster.) 

Step 3: Repeat Step 2 until the centroids of clusters Cl; C2; ...Ck 
stabilize in terms of mean-squared error criterion. 


The algorithm 1: IKM can be explained as follows, 

The inputs to the algorithm are minority class “p” and 
majority class “n” with the number of features j. The output of 
the algorithm will be the average measures such as AUC, 
Precision, F-measure, TP rate and TN rate produced by the 
IKM method. The algorithm is mainly divided into two 
phases: External Selection Phase and Clustering Phase. In the 
External Selection phase, the imbalanced dataset is divided 
into majority, minority subclasses and noisy, outliers are 
detected and removed from both the subclasses. Then the 
consistent instances I the minority set are replicated by both 
synthetic and hybridation techniques. In the clustering phase 
the so formed datasets is applied to clustering algorithm 
K-means and evaluation metrics are measured. 


V. Datasets 


Remove the instances m' from the majority set. 

Step 5: For every pi’ (i= 1,2,..., pnum’) in the minority class P, we 
calculate its m nearest neighbors from the whole training set T. The 
number of majority examples among the m nearest neighbors is 
denoted by m' (0 < m'< m). 

If m'= m, i.e. all the m nearest neighbors of pi are majority 
examples, pi’ is considered to be noise or outliers or missing values 
and are to be removed. 

Step 6: For every pi” (i= 1,2,..., pnum”) in the minority class P, we 
calculate its m nearest neighbors from the whole training set T. The 


In the study, we have considered 12 binary data-sets which 
have been collected from the KEEL [50] and UCI [51] 
machine learning repository Web sites, and they are very 
varied in their degree of complexity, number of classes, 
number of attributes, number of instances, and imbalance 
ratio (the ratio of the size of the majority class to the size of 
the minority class). The number of classes’ ranges up to 2, the 
number of attributes ranges from 8 to 60, the number of 
instances ranges from 155 to 3196, and the imbalance ratio is 
up to 3.85. This way, we have different IRs: from low 
imbalance to highly imbalanced data-sets. Table 1 
summarizes the properties of the selected data-sets: for each 
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data-set, S.no, Dataset name, the number of examples (#Ex.), 
number of attributes (#Atts.), class name of each class 
(minority and majority) and the IR. This table is ordered 
according to the name of the datasets in alphabetical order. 

Table 1 Summary of benchmark imbalanced datasets 


S.no Datasets # Ex. # Atts. Class (_,+)IR 


1. Breast 268 9(recurrence; no-recurrence) 2.37 

2. Breast_w699 9(benign; malignant) 1.90 


3. Colic 

4. Credit-g 

5. Diabetes 

6. Heart-c 

7. Heart-h 

8. Heart-stat 

9. Hepatitis 


368 22 (yes; no) 1.71 

1000 21(good;bad) 2.33 

7688 (tested-potv; tested-negtv) 1.87 
30314(<50,>50_1) 1.19 

294 14(<50,>50_1) 1.77 

270 14 (absent, present) 1.25 

155 19 (die; live) 3.85 

10. Ionosphere 35134 (b;g) 1.79 

11. Kr-vs-kp3196 37(won; no win) 1.09 

12. Sonar 208 60(rock ; mine ) 1.15 


We have obtained the AUC metric estimates by means of a 
10-fold cross-validation. That is, the data-set was split into ten 
folds, each one containing 10% of the patterns of the dataset. 
For each fold, the algorithm is trained with the examples 
contained in the remaining folds and then tested with the 
current fold. The data partitions used in this paper can be 
found in UCI-dataset repository [52] so that any interested 
researcher can reproduce the experimental study. 


VI. Comparison of Algorithms and experimental 

Setup 

This section describes the algorithms used in the experimental 
study and their parameter settings, which are obtained from 
the KEEL [50] and WEKA [51] software tools. Several 
clustering methods have been selected and compared to 
determine whether the proposal is competitive in different 
domains with the other approaches. Algorithms are compared 
on equal terms and without specific settings for each data 
problem. The parameters used for the experimental study in 
all clustering methods are the optimal values from the tenfold 
cross-validation, and they are now detailed. 

Table 2Experimental Settings for standard clustering algorithms 


Algorithm Parameter Value 

-K-Mea 

ns distance function Euclidean max iterations 500 

Number of clusters 2 

Density cluster to wrap k-means 

minstddev 1.0E-6 
FF number of clusters 2 

EM max iterations 100 

minstddev 1.0E-6 
Number of clusters 2 

Hierarchical distance function Euclidean Number of 
clusters 2 


K-Means: K-means clustering Density: Density based clustering 
FF: Farthest First clustering EM: Expectation Maximization 
Hier: Hierarchical clusteringclustering 


VII. Experimental Results 

In this section, we carry out the empirical comparison of our 
proposed algorithm with the benchmarks. Our aim is to 
answer several questions about the proposed learning 
algorithms in the scenario of two-class imbalanced problems. 

1) In first place, we want to analyze which one of the 
approaches is able to better handle a large amount of 
imbalanced data-sets with different IR, i.e., to show which 
one is the most robust method. 

2) We also want to investigate their improvement with respect 
to classic clustering methods and to look into the 
appropriateness of their use instead of applying a unique 
preprocessing step and training a single method. That is, 
whether the trade-off between complexity increment and 
performance enhancement is justified or not. Given the 
amount of methods in the comparison, we cannot afford it 
directly.On this account, we compared the proposed 
algorithm with each and every algorithm independently. This 
methodology allows us to obtain a better insight on the results 
by identifying the strengths and limitations of our proposed 
method on every compared algorithm. 


Table 3 Smutnai}' of [enfold cross vacation performance for Accuracy on all the datasets 
DatasefeK-Mfliis Deis m FF EMHier IKM 

r 


Breast 5426±10.S4* 53.66±10.74* 65.63±9.54o 49.29±S.10i 70.05±E57o 55.7S±11.S7 

Breast w95.82±2.2696.22±2.19o 34.94=6.96* 93.75±2.79* 65.52±0.44* 95.36i2.19 


Colic 

60J7±ll.S9t 65.30H0.S5* 

58.67=051* 

66.13±7.11* 

63.05±1.131 

6S.07l9.07 

Credit? 

w 

53.36=6.77* 56.37^6.72* 

62.41=6.560 

60.60i5.33o 

70iM).00o 

57.0Sl5.99 

Diabetes 

65.42±5.S7o 65.60±5.6So 

65.16=3.420 

64.67i5.74o 

65.lliO.34o 

63.1315.92 

Heart-c 

ii.mn so.ms.02o 

6S.77±10.8So 

S0.2S±7.S2o 

54.1512.06* 

77.94i9.Sl 

Heart-h 

77.S2±10.54* S0.04±S.6S* 

66.70±12.23* 

81.01±6.51* 

63.95±1.36* 

33.36=6.99 

Heart-stat 

74.39±11.42* 75.3 5H 1.60* 

66.S5±10.S7* 

Sl.78i6.55o 

5554U.60* 

77.3 SHI.49 

Hepatitis 

71.Q9H2.58* 73.15112.16* 

72.14±12.77* 

73.8311053. 

79.3Sl2.26o 

74.46H1.04 

Ionosphere 

70,S0±6.71*73.06±6.35* 

62.75l6.65* 

73.03=6.47o 

64.10H 35* 

71.0US.06 

Kr-vs-kp54.72±4.77 * 54.19i4.93* 

53.3 7±3.S0* 

59.9S±3.4So 

52.13iO.63* 

55.0414.00 

Sonar 

52.43±10.2S50.12±10.40* 

50.94±S.2S* 

49.59=9.55* 

51.7313.41* 

52.73H1.13 



Fig. 1 Test results of Accuracy on K-means, Density, FF, EM, Hier 
and IKM for Colic, Heart-h, Kr-vs-kp and Sonar Datasets. 
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Table 4 SinnHiy of tenfold [mi valid ode petforoiflnte for AUC on all ±s datasets 


Table TSuumiarof tenfold crow validation performance for F-neaiiie an all the datasets 


DatasetsK-MeatisDeiMy FF EM Eire IKM 


Breast 

0.5MHMJS=0.103t 

0.574=0.0993 

0500=018* 

0.400=0.007* 

0.552*0.116 

Braast jr0.95fc0.02 1 * 096610.021 * 

0785*0.098. 

0.951=0.022* 

0500=0.000* 

0.953*0.022 

Colic 

0 . 6 M 10 S. 0 . 6 M(# 2 i 

0.570*0.114. 

0691:0.063c 

05(0=0.000* 

0.080=010 

Credit-E 

0J3W).067. OJMO 661 

0.521=0057. 

0.507=0.051* 

0500=0.000* 

0500=0.® 

Diabetes 

0.60S±0.067i 0.«lM06Si 

0.520=0.0441 

0.670=0.070c 

0502=0.000* 

0.625=0.050 

Eari-c 

m=mu unis 

0.633=0111. 

0.304=0.073c 

0500=0.000* 

0.780=01! 

Heart-h 

0.775*0101! 0.795*0Ci96i 

0.614*0.133* 

0.792=0.073* 

0500=0.000* 

0.337=0.070 

Heart- stat 

0.746=0.114. 0.756*0.114. 

0662=0.105. 

O.SMOOSc 

0.564=0.013* 

0.770*0.110 

Hepatitis 

0.753=0.06. 0.7S1±0.122= 

0670=0163. 

0.800=0.101o 

0500=0.000* 

0.758*0.114 

[oLosdiere 

O.7O6=O.030iD.743=O.Oj9c 

0.530=0067. 

0.771=0.053= 

0500=0.000* 

0.700=0.080Er- 

vi-t?0.54t0.(Md* b.mm 

0.531=0 039. 

0.533=0.032c 

0500=0.001* 

0550=0.040' 

Sonar 

0521=0103.0499=0104. 

0.513=0.032. 

0.497*0.096* 

0500=0.000* 

0527=0.111 



H K-means ■ Density 

■FF 

HEM 

u Hier 

M IKM 



Fig. 1 Test results of Accuracy on K-means, Density, FF, EM, Hier 
and IKM for Breast_w, Credit-g, Diabetes, and Sonar Datasets. 


Table tSunmiatyaf tenfold cross valid a ban performance for Precisian an ell the datasets 


Datasets K-Meaas 

Deadly 

FF 

m 

Hire 

KM 

Eraast 0.713=0.090o 

0.709=0.033= 

0.747=0.061 c 

0.707=0.03 Sc 

0.702±0.014c 

0.693=0.105 

Breait_w0.96l40.Q24c 

0.939=0.015= 

0.323=0.071* 

0.993=0.007 c 

0.655=0.004* 

0.936=0.032 

Colic. 0.734=0.107* 

0.321=0.033= 

0.719^0.120* 

0.333=0.069= 

0.630=0.011* 

0.793=0.110 

Credit-g 0.727=0.052= 

0.727=0.053= 

0.712=0.035 c 

074640.036c 

0.700=0.000= 

0.66740.05 S 

Diabetes 0.725=0.047 = 

0.734=0.051c 

0.66540.044c 

0.321=0.072 c 

0.65240.004c 


0.62S=0.04SHaart-c 0.797=0.094= 0.331=0.036= 0.726= 

0.129* 0. 343=0.032 c 

0.526=0.101* 
Beeit-b 0.367=0.034= 

0.764=0.103 

0.3 7 7=0.03 7 c 

0.735=0.119* 

0.352=0.061 c 

0.640=0.014* 

0.334=0.071 

Heart-stat 0.306=0.116c 

0.303=0. 116 c 

0.706=0.113* 

0.339=0.07Sc 

0.549=0.063* 

0.744=0.149 

Hepatitis 0.426=0.150* 

0.453=0.156c 

0.405=0.233* 

0.457=0.136* 

0.000=0.000* 

0.613=0.153 

I on o sphaaO.55 7J. 14 7 *05 73=0. 1 4 5 • 

0.309=0.369* 

0.536=0.063* 

0.000=0.000* 

0.71740.112 

Kj - vs-bO. 561=0.041c 

0559±0.042c 

0.5Sl=0.0S2c 

0.5 79=0.022 c 

0.51240.073* 

0.546=0.040' 

Sonar 0.493=0.133* 

0.460=0.140* 

0.422=0.233* 

0.459=0.103* 

0110=0.193* 

0.530=0.123 


Table fi Summary of tenfold cross validaban performance for Recall jl all the datasets 

Datasets 

K-Means Density 

FF 

EM Hire 

KM 

Breast 

0.57740.153c 0.579=0.159c 

0.7 76=0.13 7 c 

0.432=0.094* 1.00040.004c 

0.573=0.169 

Breast.wO.97640.022095310.029* 0.992: 

:0.033 c 0.907: 

=0.042* 1.000=0.000c 0.976=0.021 

Colic 

0.541=0.231* 0.5 32=0.195 c 

0.63540.229c 

0.5 76=0. 100c 1.00040.000c 

0.561=0.221 

S 5 

O-. 

Ini 

0.595=0. 10 Sc 0.606=0. 112c 

0.77840.121c 

0.664=0.031 c 1.000=0.000c 



OJTMllSDiabeta 0.76Q4Q.091c 0.747±0.039= 095740106c 0.594*0 075* 

1.000=0.000= 0.742=0.122 


Heart-c 0.793=0.124* 

0.326=0. 104c 

0.73340.137* 

0.791=0.113* 

0.956=0.134c 

0.311=0.121 

Heart-h 0.736=0.171* 

0.315=0.126* 

0.310=0.203* 

0.356=0.073* 

1.000=0.000c 

0.372=0.106 

Heart-stat 0.724=0.157* 

0.73740.169* 

0.723=0.230* 

0.84240.036c 

0.935=0.122= 

0.740=0.174 

Hepatitis 0. 324=0.225 c 0.365=0.190c 

0.53340.319* 

0.906=0.153c 

0.000=0.000* 

0.799=0.194 

I Dn d sph ere 0.702=0. 1 3. *0.737=0.195 c 

0 13540 234* 

0.91240 074c 

0.000=0.000* 

0. 732=0.160Kr 

vs-lp0.6244.197c 

0.532=0.199* 

0.600=0.312* 

0.347=0.10Oc 

0.980=0.141 c 

0.620=0.196 

Sonar 0.471=0.201* 

0.471=0.215* 

052440.392c 

0.52540.194c 

0.23540.425* 

0.519=0.130 


Datasets K-Meaus 

Density 

FF 

m 

Hire 

KM 

Breast 0.63fc0.112c 

0.627=0'. 1 12 c 

0.755=0.08 7 c 

0.569=0.032* 

0,825±0.010o 

0.617=0.129 

Breastjv0.96S=0.017 c 

0.971=0.01 7 c 

O.SOH.fMi* 

0.950=0.024* 

079240.003* 

0.955=0.021 

Colic 0.603=0.102* 

0.662=0.13 7 c 

0.63 3=0. 141c 

0.67 3=0.032 c 

0.773=0.003 c 

0.629=0.146 

Credit -e 0.649=0.074= 

0.65 5=0.074c 

0.73940.C^o 

0.700=0.052c 

0.824=0.000= 

0.61340.076 

Diabetes 0.739=0.053= 

0.73740.050c 

0.77940.049c 

0.63540.055 c 

0.7 39=0.003 c 


0.676=0.07 SHeart-c 0793*0.098= 0.324=0.031= 0.711=0.116* 0.311=0.031= 

0.68040.130. 

0.733=0.100 





Hesrt-h 0.310=0.110* 

0.336=0.079* 

0.745=0.129* 

0.352=0.052c 

0.730=0.010* 

0.34840.080 

Heart-stat 0.752=0.125= 

0.761=0.134c 

0.691=0.144* 

0.33740.059c 

0.70540.037* 

0.732=0.143 

Hepatitis 0.549=0.15,. 

0.532=0.153* 

0.451=0.226* 

0.59840.131* 

0.0(0=0.000. 

0.679=0.141 

I on o sphere 0.617=0.155 *0.660=0.15 7 * 

0.173=0.222* 

0.71140.059 

0.000=0.000* 

0.71040.130 

Kr-vi-b0.573J.lIl= 0.55340.111* 

0.52440.133* 

063640.052 c 

0.67240.097c 

0.56340.109 

Sonar 0.462=0.149* 

0.44740.159* 

0.414=0.257* 

0.430=0.133* 

0.149=0.270* 

0.512=0.136 

Table 3 Sumnuty of tenfold cross validaben performance for Spedfidty on ell the datasets 

Datasets K-Meaus 

Density 

FF 

EM 

Hire 

KM 

Breast 0.63fc0.112= 

0.437=0.194* 

0.37340.190* 

0.513=0.132* 

0.000=0.000* 

0.531=0.185 

Breast_i4.963=0.017c 

0.979=0.029= 

0.573=0.196* 

0.996=0.012 c 

0.000=0.000* 

0.931=0.036 

Colic 0.603=0.162* 

0.773=0.157* 

0.506=0.336* 

0.307=0.093* 

0.000=0.000* 

0.316=0.162 

Credit -e 0.649=0.074= 

0.464=0.149* 

0.264=0.145* 

0.470=0.100* 

0.000=0.000* 

055940.132 

Diabetes 0.73^0.053= 

0.437=0.155* 

0.032=0.163* 

0.746=0'. 145 c 

0.000=0.000* 


0.507=0.116Heart-c 0.793=0.093= 0.739=0.122= 0.633=0.225* 0.317= 

:0.107c 

0.035=0.1340* 

0.743=0.119 





Hart-h OSlfcOllOo 

0.77540.193* 

0.413=0.322* 

0.723=0.127* 

0.000=0.000* 


0. 301=0.105Heart-stat 0.752=0.125* 0.775* 

0.155 * 0.601=0.220* 0737=0.120* 

0.019=0.123* 

0.799=0.149 





Hepatitis 0.549=0.15 7= 

0.696=0.143* 

0.75 7=0.154c 

0.695=0.125* 

1.000=0.000= 

0.71640.143 

I on o sph ere 0.6 17=0.155 *0.699^0. 110c 

0.375=0.215c 

0.629=0.093* 

1.00040.000c 

0,68740.153 

Ki-vs-b0.6Mlilo 

0.497=0.167 c 

0.46140.330c 

0.33040.057* 

0.020=0.141* 

0.430=0.174 

Sonar 0.462=0.149* 

0.523=0.193* 

0.502=0.357* 

0.470=0.171* 

0.76840.420c 

0.535=0.136 


Table 9 Sunmifliy af tenfold cross validabau performance for FF Rate an all the datasets 

Datasets 

K-Meaus 

Deasity 

FF 

EM 

Hire 

KM 

Breast 

053940.199c 

0.56340.194c 

0.62 7=0.190c 

0.432=0.132c 

1.000=0.000c 

0.469=0.185 

Breast_w0.076t0.049c 

0.021=0.029* 

0.42240.196c 

0.004=0.012* 

1.00040.000c 

0.069=0.036 

Colic 

0.235=0.240* 

0.227=0.157* 

0.494=0.336* 

0.193=0.093* 

1,000=0.000= 

0.314=0.162 

Credit-s 

052640.143c 

0.536=0.149c 

0.736=0.145 c 

0.530=0.100c 

1.00040.000c 

044140.132 

Diabetes 

0.544=0.145 c 

0.51340,155 c 

0.91840.163c 

0.254=0.145* 

1.00040.000c 

0.49340.116 

Heart-: 

0.249=0.143* 

0.211=0.122* 

0.367=0.225 c 

0.133=0.107* 

0.965=0.1 S4c 

0.25240.119 

Heart-h 

0.236=0.190c 

0.225=0.193 c 

0.5 32=0.322 c 

0.272=0.127 c 

1.00040.000c 

0.199=0.105 

Heart-stat 

0.23140.163 o 

0.225=0.155c 

0.39940.220c 

0.213=0.120c 

0.931=0.123 c 

0.201=0.149 

ill 

1: 

1/1 

0.313=0.146c 

030440.143c 

0.24340.154* 

0.30540.125c 

0.00040.000* 

0.234=0.148 

Ion o sph ere 0.239=0.110*0.301=0.110* 

0.125=0.215* 

0.3 71=0.093 c 

0.000=0.000* 

0.313=0.153 

Kr-vs-bO.536J.131c 

0.503=0.167* 

0.53M330o 

0.670=0.057 c 

0.9 30=0. 141c 

0.520=0.174 

Sonar 

0.429=0.195* 

0.472=0.193 c 

0.49840.357c 

0.530=0. 171c 

0.232=0.420* 

046540.136 


Table IQ Summary af tenfold crass validation perfamiance far FN Rate an all the datasets 

Datasets 

E-Means 

Density 

FF 

EM 

Hire 

KM 

Breast 

0.423=0.153* 

0.421=0159* 

0.224=0.137* 

0.51840.094c 

0.000=0.004* 

0.42740.169 

Breast_w0.024=0.022 

0.04740.029c 

0.00840.033* 

0.09340.042c 

0.000=0.000* 

0.024=0.021 

Colic 

0.45940.231c 

041840.195* 

0.36540.229* 

0.424=0.100* 

0.000=0.000* 

0.439=0.221 

Credit -e 

0.40540.108* 

0,39440.112* 

0.222=0.121* 

0.336=0.031* 

0.000=0.000* 

0.421=0.113 

Diabetes 

0.240=0.091* 

0.25440.039* 

0.04340.106* 

0.40640.075c 

0.000=0.000* 



0.25S=0.122Haart-c 0.202=0.124= 0.174=0.104* 0267=0.137= 0.209=0.113c 

0.035=0.134* 0.139^0.121 

Heart-k 0.214=0.171= 0.135=0.126= 0.190*0.203= 0.144*0.073= 0.000=0.000* 

0. 123=0.106Haart- stat 0.276=0.157 = 0263=0.169= 0277=0230= 0.153=0.036* 

0.015=0.122* 0.260=0.174Hepatitis 0.176=0223* 0 135*0.190* 0.417^0.319= 

0.094=0.153* 100040.000c 0.201=0.194 

Ibdd sphse 0.293=0.1S 7=0.213=0.195 • 0.315=0.234= 0.033=0.074* IOOO=O.OOOc 0.263=0.160 

Kr-vi-ta0.3 76=0.197 * 0.413=0.199= 0.400*0.312= 0.153=0.100* 0.020^0.141* 0.330^0.196 

Sonar 0.529*0.201= 0.529^0215 = 0.476=0.392 * 0 475*0 194* 0.765=0.425= 0 431*0.130 
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The clustering valuations were conducted on twelve wide 
lyused datasets. Theserea 1 world multi-dimensional 
datasetsareusedtoverifytheproposedclustering method. 

Table 3, 4, 5, 6, 7, 8, 9 and 10 reports the results of Accuracy, 
AUC, Precision, Recall, F-measure, Specificity, FP Rateand 
FN Rate respectively for all the twelve datasets from UCI. A 
two-tailed corrected resampled paired t-test [46] is used in 
this paper to determine whether the results of the 
cross-validation show that there is a difference between the 
two algorithms is significant or not. Difference in accuracy is 
considered significant when the p-value is less than 0.05 
(confidence level is greater than 95%). The results in the 
tables show that IKM has given a good improvement on all 
the clustering measures. 

Two main reasons support the conclusion achieved above. 
The first one is the decrease of instances in majority subset, 
has also given its contribution for the better performance of 
our proposed IKM algorithms. The second reason, it is 
well-known that the resampling of synthetic instances in the 
minority subset is the only way in oversampling but 
conduction proper exploration-exploitation of prominent 
instances in minority subset is the key for the success of our 
algorithm. Another reason is the deletion of noisy instances by 
the interpolation mechanism of IKM. 

Finally, we can make a global analysis of results combining 
theresults offered by Tables from 3-10: 

• Our proposals, IKMis the best performing one when 
the data sets are of imbalance category. We have 
considered a complete competitive set of methods 
and an improvement of results is expected in the 
benchmark algorithms i;eK-means, Density, FF, EM 
and Hier. However, they are not able to outperform 
IKM. In this sense, the competitive edge of IKM can 
be seen. 

Considering that IKM behaves similarly or not effective than 
K-means shows the unique properties of the datasets where 
there is scope of improvement in majority subset and not in 
minority subset. Our IKM can mainly focus on improvements 
in minority subset which is not effective for some unique 
property datasets. 


Table 11 Summary of experimental resuLcs for Improved It- 
mean.5 


TN Rate IKM v erms K-me ans 
IKM v ersus Density 
IKM versus FF 
IKM versus EM 
IKM v ersus Hier 


4 0 S 

5 0 4 
9 0 3 
9 0 3 
9 0 3 


FP Rate IKM v ersus K-me ans 
IKM versus Density 
IKM versus FF 
IKM v ersus EM 
IKM versus Hier 


4 0 3 

5 0 7 

3 0 9 

4 0 S 

3 0 9 


FN Rate IKM versus K-me ans. 
IKM versus Density 
IKM versus FF 
IKM versus EM 
IKM versus Hier 


5 1 6 

7 0 5 

6 0 15 

7 0 5 

9 0 3 


The summary of experimental results of IKM on all the 
compared clustering algorithms is shown in Table 11. The 
results show that proposed IKM clustering algorithm is at 
least as effective as and at times more effective than K-means, 
Density, FF, EM and Hierarchical clustering algorithms. IKM 
compared with accuracy onK-means wins on 7 dataset and 
ties on 4 datasets and loses ononly 1 dataset. The performance 
of IKM compared withDensity based clustering wins on 9 
datasetsand losses ononly 3 datasets.The performance of IKM 
compared with FF wins on 8 datasets andlosses on 4 datasets. 
The validation of IKM on EM wins on 6 datasets and losses 
on 6 datasets. However, performance of IKM on Hierarchical 
clustering wins on 8 datasets and losses on 4 datasets. The 
AUC, Precision, Recall, F-measure, TN Rate, FP Rate and FN 
Rate measure have shown to perform well with respect to 
IKM. 

The strengths of our model are that IKM only over-sample the 
most prominent examples recursively thereby strengthens the 
minority class. One more point to consider is our method tries 
to remove the most misclassified instances from both majority 
and minority set. Firstly, the removal of some weak instances 
from majority set will not harm the dataset; in fact it will 
reduce the root cause of our problem of class imbalance as a 
whole by reducing majority samples in a small proportion. 
Second, the removal of weak instances from the minority set 
will again help in better generation of synthetic examples of 
both same and hybrid type. 


R. e suit s Sy st e m s 

Wins 

Ties 

Losse s 

Ac cui'a cy IKM v ersus K-me ans 

7 

4 

1 

IKTvl v ersus 13 ensity 

9 

O 

3 

IKTvl v ersus KF 

3 

O 

4 

IKM v ersus ETvl 

g 

O 

g 

IKM v ersus Hier 

3 

O 

4 

AUC IKM v ersus K-me ans 

12 

O 

Q 

IKM v ersus 13 ensity 

9 

O 

3 

IKM v ersus KF 

11 

O 

1 

IKM v ersus IM 

g 

O 

g 

IKM v ersus Hier 

12 

O 

Q 

Precision IKM v ersus K-ms ans 

4 

O 

3 

IKTvl v ersus 13ensity 

2 

O 

io 

IKM v ersus KF 

3 

O 

4 

IKM v ersus ETvl 

3 

O 

9 

IKM v ersus Hier 

9 

O 

3 

Ke c all IKM v ersus K-me ans 

g 

O 

g 

IKTvl v ersus 13 ensity 

5 

O 

7 

IKM v ersus KF 

g 

O 

g 

IKM v ersus EM 

5- 

O 

7 

IKTvl v ersus Hier 

3 

O 

9 

K-me a sure IKM v ersus K-me ans 

5 

O 

7 

IKM v ersus 13 ensity 

5 

O 

7 

IKM v ersus KF 

3 

O 

4 

IKM v ersus ETvl 

4 

O 

3 

IKM v ersus Hier 

7 

O 

5 


Finally, we can say that IKM are one of the best alternatives to 
handle class imbalance problems effectively.This 
experimental study supports the conclusion that the a 
prominent recursive oversampling approach can improve the 
CIL behavior when dealing with imbalanced data-sets, as it 
has helped the IKM methods to be the best 
performingalgorithms when compared with four classical and 
well-known algorithms: K-means, Density, FF, EM and a 
well-established Hierarchical algorithm. 

VIII. Conclusion 

In this paper, a novel clustering algorithm for imbalanced 
distributed data hasbeen proposed. This method uses unique 
oversampling technique to almost balance dataset such that to 
minimize the “uniform effect“ in the clustering process. 
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Empirical results have shown that IKM considerably reduces 
the uniform effect while retaining or improving the clustering 
measure when compared with benchmark methods. In fact, 
the proposedmethod may also be useful as a frame work for 
data sources for better clustering measures. 
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